Goto

Collaborating Authors

 search space







KD-Zero: Evolving Knowledge Distiller for Any Teacher-Student Pairs

Neural Information Processing Systems

Knowledge distillation (KD) has emerged as an effective technique for compressing models that can enhance the lightweight model. Conventional KD methods propose various designs to allow student model to imitate the teacher better.



Appendix A Control algorithm The action-value function can be decomposed into two components as: Q (PT) (s, a) = Q (P) (s, a) + Q (T) w

Neural Information Processing Systems

We use induction to prove this statement. The penultimate step follows from the induction hypothesis completing the proof. Then, the fixed point of Eq.(5) is the value function of in f M . We focus on permanent value function in the next two theorems. The permanent value function is updated using Eq.



Scaling Laws for Hyperparameter Optimization

Neural Information Processing Systems

Hyperparameter optimization is an important subfield of machine learning that focuses on tuning the hyperparameters of a chosen algorithm to achieve peak performance. Recently, there has been a stream of methods that tackle the issue of hyperparameter optimization, however, most of the methods do not exploit the dominant power law nature of learning curves for Bayesian optimization. In this work, we propose Deep Power Laws (DPL), an ensemble of neural network models conditioned to yield predictions that follow a power-law scaling pattern. Our method dynamically decides which configurations to pause and train incre-mentally by making use of gray-box evaluations.